Sentiment Analysis
Before we can even begin to dive into analyzing text, we must first process the text. Processing text involves several steps that will be combined in various ways, depending on what we are trying to accomplish.
Tense aside, are chewed, chew, and chewing the same thing? Yes, but what if we compare the actual strings? On a string comparison side, are they the same? No. We have a string with 6, 4, and 7 characters, respectively.
What if we remove the suffixes, “ed” and “ing” – we are left with three instances of “chew”? Now we have something that is equivalent in meaning and in a string sense. This is the goal of stemming.
Let’s take a look to see how this works (you will need to install tm and SnowballC first):
chewStrings <- c("chew", "chewing", "chewed", "chewer")
tm::stemDocument(chewStrings)
[1] "chew" "chew" "chew" "chewer"
We got exactly what we expected, right? You might have noticed that “chewer” did not get stemmed. Do you have any idea why? Let’s think through it together. “Chew”, “chewing”, and “chewed” are all verbs related to the act of chewinging. “Chewer”, on the other hand, is a person who chews – it is a noun. Martin Porter’s stemming algorithm works incredibly well!
Hopefully, this makes conceptual sense; however, we also need to understand why we need to do it. In a great many text-based methods, we are going to create a matrix that keeps track of every term (i.e., word) in every document – this is known as a document-term matrix. If we know that “chew”, “chewing”, and “chewed” all refer to the same thing, we want it just represented once within our document-term matrix.
Shall we take a look?
library(tm)
documents <- c("I like to chew",
"I have chewed my whole life",
"Chewing and stomping through the fields",
"I am a chewer")
documentsCorp <- tm::SimpleCorpus(VectorSource(documents))
documentsDTM <- DocumentTermMatrix(documentsCorp)
inspect(documentsDTM)
<<DocumentTermMatrix (documents: 4, terms: 13)>>
Non-/sparse entries: 13/39
Sparsity : 75%
Maximal term length: 8
Weighting : term frequency (tf)
Sample :
Terms
Docs and chew chewed chewing fields have life like stomping whole
1 0 1 0 0 0 0 0 1 0 0
2 0 0 1 0 0 1 1 0 0 1
3 1 0 0 1 1 0 0 0 1 0
4 0 0 0 0 0 0 0 0 0 0
We can see that without stemming, we have 9 terms (things like “I”, “a”, and “to” get removed automatically). Let’s do some stemming now:
documentsStemmed <- stemDocument(documents)
documentsStemmed
[1] "I like to chew"
[2] "I have chew my whole life"
[3] "Chew and stomp through the field"
[4] "I am a chewer"
And now the document-term matrix:
stemmedDocCorp <- tm::SimpleCorpus(VectorSource(documentsStemmed))
stemmedDocDTM <- DocumentTermMatrix(stemmedDocCorp)
inspect(stemmedDocDTM)
<<DocumentTermMatrix (documents: 4, terms: 11)>>
Non-/sparse entries: 13/31
Sparsity : 70%
Maximal term length: 7
Weighting : term frequency (tf)
Sample :
Terms
Docs and chew field have life like stomp the through whole
1 0 1 0 0 0 1 0 0 0 0
2 0 1 0 1 1 0 0 0 0 1
3 1 1 1 0 0 0 1 1 1 0
4 0 0 0 0 0 0 0 0 0 0
If we are trying to find documents that are covering similar content or talking about similar things, this document-term matrix will help to draw better conclusions, because it is clear that the first three documents are talking about the act of chewing and this document-term matrix reflects that.
Stemming is often sufficient (and most modern stemmers work pretty well on their own). Still, stemming is slightly more akin to amputating an arm with a battle ax – it works, but it is brute force. Lemmatization is a more sophisticated approach. You might have already guessed that lemmatization will find the lemma of a word and since you likely know about morphology, you already know that the lemma of a word is its canonical form. A group of words that form the same idea are called a lexeme (am, be, are are all within the same lexeme). Generally, the smallest form of the word is chosen as the lemma. This is a really interesting area of linguistics, but we don’t need to dive fully in.
Instead, let’s see it in action.
If we compare some “chewing” stuff on stemming and lemmatization, we can see what we get:
library(textstem)
chewStrings <- c("chew", "chewing", "chewed", "chewer")
stem_words(chewStrings)
[1] "chew" "chew" "chew" "chewer"
lemmatize_words(chewStrings)
[1] "chew" "chew" "chew" "chewer"
Absolutely nothing different. Both stemming and lemmatizing will perform the same task. The act of chewing is comprised of a past, present, and future tense, and chew is the lemma; chewer is still seen as something else entirely.
But let’s take a look at something different. If we have a string of the most lovely words, what might happen?
lovelyString <- c("lovely", "lovelier", "loveliest")
stem_words(lovelyString)
[1] "love" "loveli" "loveliest"
That is about as close to “bigly” nonsense as we could possibly get without going into Dr. Suess mode.
But if we try lemmatization:
lemmatize_words(lovelyString)
[1] "lovely" "lovely" "lovely"
We get something that starts to make sense. Now, let’s try these on some actual chunks of text and see what happens.
# This data is in the "data" folder on Sakai!
if(Sys.info()["sysname"] == "Darwin") {
load("~/courses/unstructured/data/allLyricsDF.RData")
} else {
load("C:/Users/sberry5/Documents/teaching/courses/unstructured/data/allLyricsDF.RData")
}
sampleLyrics <- allLyricsDF[40, ]
sampleLyrics$lyrics
[1] \n \n \n [Verse 1]\nShe loved him like he was\nThe last man on Earth\nGave him everything she ever had\nHe'd break her spirit down\nThen come loving up on her\nGive a little, then take it back\nShe'd tell him about her dreams\nHe'd just shoot 'em down\nLord he loved to make her cry\n"You're crazy for believing\nYou'll ever leave the ground"\nHe said, "Only angels know how to fly"\n[Chorus]\nAnd with a broken wing\nShe still sings\nShe keeps an eye on the sky\nWith a broken wing\nShe carries her dreams\nMan, you ought to see her fly\n[Verse 2]\nOne Sunday morning\nShe didn't go to church\nHe wondered why she didn't leave\nHe went up to the bedroom\nFound a note by the window\nWith the curtains blowin' in the breeze\n[Chorus]\nAnd with a broken wing\nShe still sings\nShe keeps an eye on the sky\nWith a broken wing\nShe carries her dreams\nMan, you ought to see her fly\n[Outro]\nWith a broken wing\nShe carries her dreams\nMan, you ought to see her\nFlyyyyyyyyyyyyy\nWith a broken wing\n\n\n \n \n
3106 Levels: \n \n \n I'll need time to get you off my mind\nAnd I may sometimes bother you\nTry to be in touch with you\nEven ask too much of you from time to time\nNow and then\nLord, you know I'll need a friend\nAnd 'till I get used to losing you\nLet me keep on using you\n'Til I can make it on my own\nI'll get by, but no matter how I try\nThere'll be times that you'll know I'll call\nChances are my tears will fall\nAnd I'll have no pride at all, from time to time\nBut they say, oh, there'll be a brighter day\nBut 'til then I lean on you\nThat's all I mean to do\n'Til I can make it on my own\nSurely someday I'll look up and see the morning sun\nWithout another lonely night behind me\nThen I'll know I'm over you and all my crying's done\nNo more hurtin' memories can find me\nBut 'til then\nLord, You know I'm gonna need a friend\n'Til I get used to losing you\nLet me keep on using you\n'Til I can make it on my own\n'Til I can make it on my own\n\n\n \n \n ...
Of course, we will need to do some cleaning on our text first:
library(dplyr)
library(stringr)
cleanLyrics <- sampleLyrics$lyrics %>%
str_replace_all(., "\n", " ") %>%
str_replace_all(., "\\[[A-Za-z]+\\s*[0-9]*]", "") %>%
str_squish(.) %>%
gsub("([a-z])([A-Z])", "\\1 \\2", .)
You can take it for a spin – compare what comes from stemming and lemmatizing:
Here is something very interesting:
microbenchmark::microbenchmark(stem_strings(cleanLyrics),
lemmatize_strings(cleanLyrics))
What does this tell you?
The question, then, is what do you decide to do. For my money, lemmatization does a better job and getting words down to their actual meaning.
Some words do us very little good: articles, prepositions, and very high frequency words. These are all words that need to be removed. Fortunately, you don’t have to do this on your own – a great many dictionaries exist that contain words ready for removal.
tm::stopwords("en")
[1] "i" "me" "my" "myself" "we"
[6] "our" "ours" "ourselves" "you" "your"
[11] "yours" "yourself" "yourselves" "he" "him"
[16] "his" "himself" "she" "her" "hers"
[21] "herself" "it" "its" "itself" "they"
[26] "them" "their" "theirs" "themselves" "what"
[31] "which" "who" "whom" "this" "that"
[36] "these" "those" "am" "is" "are"
[41] "was" "were" "be" "been" "being"
[46] "have" "has" "had" "having" "do"
[51] "does" "did" "doing" "would" "should"
[56] "could" "ought" "i'm" "you're" "he's"
[61] "she's" "it's" "we're" "they're" "i've"
[66] "you've" "we've" "they've" "i'd" "you'd"
[71] "he'd" "she'd" "we'd" "they'd" "i'll"
[76] "you'll" "he'll" "she'll" "we'll" "they'll"
[81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
[86] "haven't" "hadn't" "doesn't" "don't" "didn't"
[91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
[96] "cannot" "couldn't" "mustn't" "let's" "that's"
[101] "who's" "what's" "here's" "there's" "when's"
[106] "where's" "why's" "how's" "a" "an"
[111] "the" "and" "but" "if" "or"
[116] "because" "as" "until" "while" "of"
[121] "at" "by" "for" "with" "about"
[126] "against" "between" "into" "through" "during"
[131] "before" "after" "above" "below" "to"
[136] "from" "up" "down" "in" "out"
[141] "on" "off" "over" "under" "again"
[146] "further" "then" "once" "here" "there"
[151] "when" "where" "why" "how" "all"
[156] "any" "both" "each" "few" "more"
[161] "most" "other" "some" "such" "no"
[166] "nor" "not" "only" "own" "same"
[171] "so" "than" "too" "very"
Removing stopwords takes little effort!
documents = c("I like to chew.",
"I am stompin and chewing.",
"Chewing is in my blood.",
"I am a chewer")
tm::removeWords(documents, words = stopwords("en"))
[1] "I like chew." "I stompin chewing."
[3] "Chewing blood." "I chewer"
We can even include custom stopwords:
tm::removeWords(documents, words = c("blood", stopwords("en")))
[1] "I like chew." "I stompin chewing."
[3] "Chewing ." "I chewer"
There are many different stopword lists out there, so you might want to poke around just a little bit to find something that will suit the needs of a particular project.
library(stopwords)
Applied to our previous song, here is what we would get:
tm::removeWords(cleanLyrics, words = stopwords("en"))
[1] "She loved like The last man Earth Gave everything ever He'd break spirit Then come loving Give little, take back She'd tell dreams He'd just shoot 'em Lord loved make cry \"You're crazy believing You'll ever leave ground\" He said, \"Only angels know fly\" And broken wing She still sings She keeps eye sky With broken wing She carries dreams Man, see fly One Sunday morning She go church He wondered leave He went bedroom Found note window With curtains blowin' breeze And broken wing She still sings She keeps eye sky With broken wing She carries dreams Man, see fly With broken wing She carries dreams Man, see Flyyyyyyyyyyyyy With broken wing"
Now, let’s use the textclean package to handle contraction replacement:
replacedText <- textclean::replace_contraction(cleanLyrics)
tm::removeWords(replacedText, words = stopwords("en"))
[1] "She loved like The last man Earth Gave everything ever break spirit Then come loving Give little, take back She tell dreams just shoot 'em Lord loved make cry \" crazy believing ever leave ground\" He said, \"Only angels know fly\" And broken wing She still sings She keeps eye sky With broken wing She carries dreams Man, see fly One Sunday morning She go church He wondered leave He went bedroom Found note window With curtains blowin' breeze And broken wing She still sings She keeps eye sky With broken wing She carries dreams Man, see fly With broken wing She carries dreams Man, see Flyyyyyyyyyyyyy With broken wing"
There are several great functions in textclean – I highly suggest you check it out.
And one final point to make:
gsub('"', "", replacedText)
[1] "She loved him like he was The last man on Earth Gave him everything she ever had he would break her spirit down Then come loving up on her Give a little, then take it back She would tell him about her dreams he would just shoot 'em down Lord he loved to make her cry you are crazy for believing you will ever leave the ground He said, Only angels know how to fly And with a broken wing She still sings She keeps an eye on the sky With a broken wing She carries her dreams Man, you ought to see her fly One Sunday morning She did not go to church He wondered why she did not leave He went up to the bedroom Found a note by the window With the curtains blowin' in the breeze And with a broken wing She still sings She keeps an eye on the sky With a broken wing She carries her dreams Man, you ought to see her fly With a broken wing She carries her dreams Man, you ought to see her Flyyyyyyyyyyyyy With a broken wing"
There are several R packages that will help us process text. The tm package is popular and automates most of our work. You already saw how we use the stemming and stopword removal functions, but tm is full of fun stuff and allows for one pass text processing.
documents <- c("I like to chew.",
"I am stompin and chewing.",
"Chewing is in my blood.",
"I am a chewer")
documentCorp <- SimpleCorpus(VectorSource(documents))
stopWordRemoval <- function(x) {
removeWords(x, stopwords("en"))
}
textPrepFunctions <- list(tolower,
removePunctuation,
lemmatize_strings,
stopWordRemoval,
removeNumbers,
stripWhitespace)
documentCorp <- tm_map(documentCorp, FUN = tm_reduce, tmFuns = textPrepFunctions)
documentCorp[1][[1]]$content
Once you get your text tidied up (or even before), you can produce some visualizations!
library(tidytext)
library(wordcloud2)
allLyricsDF %>%
dplyr::filter(stringDistance < .2) %>%
dplyr::select(lyrics, returnedArtistName) %>%
mutate(lyrics = as.character(lyrics),
lyrics = str_replace_all(lyrics, "\n", " "),
lyrics = str_replace_all(lyrics, "\\[[A-Za-z]+\\s*[0-9]*]", ""),
lyrics = str_squish(lyrics),
lyrics = gsub("([a-z])([A-Z])", "\\1 \\2", lyrics)) %>%
unnest_tokens(word, lyrics) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(n > 25) %>%
na.omit() %>%
wordcloud2(shape = "cardioid")
Sentiment analysis is commonly used when we want to know the general feelings of what someone has written or said. Sentiment analysis is commonly seen applied to Twitter and other social media posts, but we can use it anywhere where people have written/said something (product reviews, song lyrics, final statements).
Sentiment can take many different forms: positive/negative affect, emotional states, and even financial contexts.
Let’s take a peak at some simple sentiment analysis.
Let’s consider the following statements:
library(tidytext)
statement <- "I dislike beer, but I really love the shine."
tokens <- data_frame(text = statement) %>%
unnest_tokens(tbl = ., output = word, input = text)
tokens
# A tibble: 9 x 1
word
<chr>
1 i
2 dislike
3 beer
4 but
5 i
6 really
7 love
8 the
9 shine
Now, we can compare the tokens within our statement to some pre-defined dictionary of positive and negative words.
library(tidyr)
tokens %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment) %>%
pivot_wider(values_from = n, names_from = sentiment) %>%
mutate(sentiment = positive - negative)
# A tibble: 1 x 3
negative positive sentiment
<int> <int> <int>
1 1 2 1
When we use Bing’s dictionary, we see that we get one positive word (love) and negative word (dislike) with a neutral overall sentiment (a sentiment of 0 would indicate neutrality, while anything above 0 has an increasing amount of positivity and anything below 0 has an increasing amount of negativity).
Do you think that disklike and love are of the same magnitude? If I had to make a wild guess, I might say that love is stronger than dislike. Let’s switch out our sentiment library to get something with a little better notion of polarity magnitute.
tokens %>%
inner_join(get_sentiments("afinn"))
# A tibble: 2 x 2
word value
<chr> <dbl>
1 dislike -2
2 love 3
Now this looks a bit more interesting! “Love” has a stronger positive polarity than “dislike” has negative polarity. So, we could guess that we would have some positive sentiment.
If we divide the sum of our word sentiments by the number of words within the dictionary, we should get an idea of our sentences overall sentiment.
tokens %>%
inner_join(get_sentiments("afinn")) %>%
summarize(n = nrow(.), sentSum = sum(value)) %>%
mutate(sentiment = sentSum / n)
# A tibble: 1 x 3
n sentSum sentiment
<int> <dbl> <dbl>
1 2 1 0.5
Our sentiment of .5 tells us that our sentence is positive, even if only slightly so.
While these simple sentiment analyses provide some decent measures to the sentiment of our text, we are ignoring big chunks of our text by just counting keywords.
For example, it is probably fair to say that “really love” is stronger than just “love”. We might want to switch over to some techniques that consider n-grams and other text features to calculate sentiment.
When we use sentiment analysis that is aware of context, valence (“love” is stronger than “like”), modifiers (e.g., “really love”), and adversative statements (“but,…”, “however,…”), we get a better idea about the real sentiment of the text.
We will use the jockers sentiment library, but many more available. Depending on your exact needs, there are some dictionaries designed for different applications.
Before we engage in our whole sentiment analysis, let’s take a look at a few things.
Here is the dictionary that jockers will use.
lexicon::hash_sentiment_jockers
x y
1: abandon -0.75
2: abandoned -0.50
3: abandoner -0.25
4: abandonment -0.25
5: abandons -1.00
---
10734: zealous 0.40
10735: zenith 0.40
10736: zest 0.50
10737: zombie -0.25
10738: zombies -0.25
You might want to use View() to get a complete look at what is happening in there.
We should also take a peak at our valence shifters:
lexicon::hash_valence_shifters
x y
1: absolutely 2
2: acute 2
3: acutely 2
4: ain't 1
5: aint 1
---
136: whereas 4
137: won't 1
138: wont 1
139: wouldn't 1
140: wouldnt 1
With all of that out of the way, let’s get down to the matter at hand:
library(sentimentr)
library(lexicon)
library(magrittr)
statement = "I dislike beer, but I really love the shine."
sentiment(statement, polarity_dt = lexicon::hash_sentiment_jockers)
element_id sentence_id word_count sentiment
1: 1 1 9 0.9375
We can see that we get a much stronger sentiment score when we include more information within the sentence. While the first part of our sentence starts out with a negative word (dislike has a sentiment value of -1), we have an adversarial “but” that will downweight whatever is in the initial phrase and then we will have the amplified (from “really”, with a default weight of .8) sentiment of “love” (with a weight of .75 in our dictionary).
With all of this together, we get a much better idea about the sentiment of our text.
There are also some handy functions within sentimentr:
extractedTerms <- extract_sentiment_terms(statement, polarity_dt = lexicon::hash_sentiment_jockers)
attributes(extractedTerms)$counts
words polarity n
1: <NA> 0 1
attributes(extractedTerms)$elements
element_id sentence_id words polarity
1: 1 1 <NA> 0
While the text that we have used so far serves its purpose as an example quite well, we can always take a look at other written words.
load(url("https://raw.githubusercontent.com/saberry/courses/master/hash_sentiment_vadar.RData"))
cleanLyrics <- allLyricsDF %>%
filter(stringDistance < .2) %>%
dplyr::select(lyrics, returnedArtistName, returnedSong) %>%
mutate(lyrics = as.character(lyrics),
lyrics = str_replace_all(lyrics, "\n", " "),
lyrics = str_replace_all(lyrics, "(\\[.*?\\])", ""), # look different?
lyrics = str_squish(lyrics),
lyrics = gsub("([a-z])([A-Z])", "\\1 \\2", lyrics))
songSentiment <- sentiment(get_sentences(cleanLyrics),
polarity_dt = hash_sentiment_vadar) %>%
group_by(returnedSong) %>%
summarize(meanSentiment = mean(sentiment)) # Check sentimentr for better options!
Naturally, we would want to join those sentiment values up with our original data:
cleanLyrics <- left_join(cleanLyrics, songSentiment, by = "returnedSong")
From here, we have several choices in front of us. One, we could use those sentiment values within a model (e.g., we might want to predict charting position). Or, we could use them for some further exploration:
library(DT)
sentimentBreaks = c(-1.7, -.5, 0, .5, 1.7)
breakColors = c('rgb(178,24,43)','rgb(239,138,98)','rgb(253,219,199)','rgb(209,229,240)','rgb(103,169,207)','rgb(33,102,172)')
datatable(cleanLyrics, rownames = FALSE,
options = list(pageLength = 15, escape = FALSE,
columnDefs = list(list(targets = 1, visible = FALSE)))) %>%
formatStyle("lyrics", "meanSentiment", backgroundColor = styleInterval(sentimentBreaks, breakColors))
We can also do some checking over time:
library(ggplot2)
if(Sys.info()["sysname"] == "Darwin") {
load("~/courses/unstructured/data/countryTop50.RData")
} else {
load("C:/Users/sberry5/Documents/teaching/courses/unstructured/data/countryTop50.RData")
}
allTop50 <- allTop50 %>%
group_by(song) %>%
slice(1)
cleanLyrics <- left_join(cleanLyrics, allTop50, by = c("returnedSong" = "song"))
cleanLyrics %>%
group_by(date) %>%
na.omit() %>%
summarize(meanSentiment = mean(meanSentiment)) %>%
ggplot(., aes(date, meanSentiment)) +
geom_point() +
theme_minimal()
That is pretty messy (but I am curious about that really happy month), so let’s try something else:
library(gganimate)
animatedYears <- cleanLyrics %>%
mutate(year = lubridate::year(date),
month = lubridate::month(date)) %>%
group_by(year, month, date) %>%
na.omit() %>%
summarize(meanSentiment = mean(meanSentiment)) %>%
ggplot(., aes(as.factor(month), meanSentiment, color = meanSentiment)) +
geom_point() +
scale_color_distiller(type = "div") +
theme_minimal() +
transition_states(year,
transition_length = length(1975:2019),
state_length = 3) +
ggtitle('Year: {closest_state}')
animate(animatedYears, fps = 5)
cleanLyrics %>%
mutate(year = lubridate::year(date)) %>%
group_by(year) %>%
na.omit() %>%
summarize(meanSentiment = mean(meanSentiment)) %>%
ggplot(., aes(year, meanSentiment)) +
geom_col() +
theme_minimal()
Sentiment analysis is always a handy tool to have around. You might also want to explore other descriptive aspects of your text.
The koRpus package allows for all types of interesting types descriptives. There are a great number of readability and lexical diversity statistics (Fucks is likely my favorite).
We need to tokenize our text in a manner that will please koRpus.
library(koRpus)
readability(
tokenize(
cleanLyrics$lyrics[27], format = "obj", lang = "en"
), quiet = TRUE
)
Automated Readability Index (ARI)
Parameters: default
Grade: 183.5
Coleman-Liau
Parameters: default
ECP: 67% (estimted cloze percentage)
Grade: 4.63
Grade: 4.63 (short formula)
Danielson-Bryan
Parameters: default
DB1: 37.19
DB2: -246.93
Grade: >= 13 (college)
Dickes-Steiwer's Handformel
Parameters: default
TTR: 0.35
Score: 34.37
Easy Listening Formula
Parameters: default
Exsyls: 59
Score: 59
Farr-Jenkins-Paterson
Parameters: default
RE: -279.3
Grade: >= 16 (college graduate)
Flesch Reading Ease
Parameters: en (Flesch)
RE: -274.56
Grade: >= 16 (college graduate)
Flesch-Kincaid Grade Level
Parameters: default
Grade: 145.21
Age: 150.21
Gunning Frequency of Gobbledygook (FOG)
Parameters: default
Grade: 151.22
FORCAST
Parameters: default
Grade: 7.35
Age: 12.35
Fucks' Stilcharakteristik
Score: 1315
Grade: 36.26
Linsear Write
Parameters: default
Easy words: 98.94
Hard words: 1.06
Grade: 192.5
Läsbarhetsindex (LIX)
Parameters: default
Index: 382.04
Rating: very difficult
Grade: > 11
Neue Wiener Sachtextformeln
Parameters: default
nWS 1: 60.26
nWS 2: 61.54
nWS 3: 71.02
nWS 4: 98.73
Readability Index (RIX)
Parameters: default
Index: 19
Grade: > 12 (college)
Simple Measure of Gobbledygook (SMOG)
Parameters: default
Grade: 14.55
Age: 19.55
Strain Index
Parameters: default
Index: 132
Kuntzsch's Text-Redundanz-Index
Parameters: default
Short words: 318
Punctuation: 36
Foreign: 0
Score: 39.55
Tuldava's Text Difficulty Formula
Parameters: default
Index: 6.92
Wheeler-Smith
Parameters: default
Score: 590
Grade: > 4
Text language: en
For other lingual measures, you can check out the quanteda package:
library(quanteda)
tokens(cleanLyrics$lyrics[27]) %>%
textstat_lexdiv(measure = c("TTR", "CTTR", "K"))
document TTR CTTR K
1 text1 0.374269 4.894202 205.5333
For TTR, consider 0 to be the same word on repeat and 1 to be no word repitition.
library(rvest)
choctawBingoLyrics <- read_html("https://genius.com/James-mcmurtry-choctaw-bingo-lyrics") %>%
html_nodes(".lyrics") %>%
html_text() %>%
str_replace_all(., "\n", " ") %>%
str_replace_all(., "\\[\\w+\\s*\\w*\\]", "") %>%
str_squish(.) %>%
gsub("([a-z])([A-Z])", "\\1 \\2", .)
tokens(choctawBingoLyrics) %>%
textstat_lexdiv(measure = c("TTR", "CTTR", "K"))
document TTR CTTR K
1 text1 0.4774775 8.713146 87.24941
English is hard and seeing how words relate to other words can be tricky. Let’s think through something weird:
\[\text{church} - \text{jesus} + \text{muhammad} = \text{mosque}\]
Or how about:
\[\text{Washing D.C.} - \text{America} + \text{Mexico} = \text{Mexico City}\]
What is the purpose here? Word embeddings start to break down how words can be different, but still deal in the same contextual space. Whether we are talking about a church or a mosque, we are just dealing with a place of worship. If we have two different text documents (one talking about mosques and one talking about churches), it would be nice to be able to recognize that they are largely talking about the same idea.
library(text2vec)
links <- c("https://en.wikipedia.org/wiki/Christianity",
"https://en.wikipedia.org/wiki/Islam",
"https://en.wikipedia.org/wiki/Muslims",
"https://en.wikipedia.org/wiki/Jesus",
"https://en.wikipedia.org/wiki/Muhammad",
"https://en.wikipedia.org/wiki/Quran",
"https://en.wikipedia.org/wiki/Bible")
allText <- lapply(links, function(x) {
read_html(x) %>%
html_nodes("p") %>%
html_text() %>%
gsub("\\[[0-9]*\\]|[[:punct:]]", " ", .) %>%
stringr::str_squish(.) %>%
tolower(.) %>%
tm::removeWords(., tm::stopwords("en"))
})
tokens <- space_tokenizer(unlist(allText))
it <- itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 20L)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer, skip_grams_window = 10L)
glove = GlobalVectors$new(rank = 50, x_max = 10)
glove_main <- glove$fit_transform(tcm, n_iter = 250, quiet = TRUE)
INFO [05:26:35.292] epoch 1, loss 0.1697
INFO [05:26:35.338] epoch 2, loss 0.0916
INFO [05:26:35.369] epoch 3, loss 0.0710
INFO [05:26:35.394] epoch 4, loss 0.0592
INFO [05:26:35.418] epoch 5, loss 0.0511
INFO [05:26:35.442] epoch 6, loss 0.0451
INFO [05:26:35.466] epoch 7, loss 0.0404
INFO [05:26:35.490] epoch 8, loss 0.0367
INFO [05:26:35.514] epoch 9, loss 0.0337
INFO [05:26:35.539] epoch 10, loss 0.0311
INFO [05:26:35.563] epoch 11, loss 0.0290
INFO [05:26:35.587] epoch 12, loss 0.0271
INFO [05:26:35.611] epoch 13, loss 0.0255
INFO [05:26:35.635] epoch 14, loss 0.0241
INFO [05:26:35.659] epoch 15, loss 0.0229
INFO [05:26:35.684] epoch 16, loss 0.0218
INFO [05:26:35.708] epoch 17, loss 0.0208
INFO [05:26:35.732] epoch 18, loss 0.0199
INFO [05:26:35.757] epoch 19, loss 0.0191
INFO [05:26:35.781] epoch 20, loss 0.0184
INFO [05:26:35.805] epoch 21, loss 0.0178
INFO [05:26:35.829] epoch 22, loss 0.0171
INFO [05:26:35.853] epoch 23, loss 0.0166
INFO [05:26:35.877] epoch 24, loss 0.0161
INFO [05:26:35.902] epoch 25, loss 0.0156
INFO [05:26:35.926] epoch 26, loss 0.0151
INFO [05:26:35.949] epoch 27, loss 0.0147
INFO [05:26:35.973] epoch 28, loss 0.0143
INFO [05:26:35.998] epoch 29, loss 0.0140
INFO [05:26:36.021] epoch 30, loss 0.0136
INFO [05:26:36.045] epoch 31, loss 0.0133
INFO [05:26:36.069] epoch 32, loss 0.0130
INFO [05:26:36.092] epoch 33, loss 0.0127
INFO [05:26:36.115] epoch 34, loss 0.0125
INFO [05:26:36.141] epoch 35, loss 0.0122
INFO [05:26:36.164] epoch 36, loss 0.0120
INFO [05:26:36.186] epoch 37, loss 0.0117
INFO [05:26:36.213] epoch 38, loss 0.0115
INFO [05:26:36.236] epoch 39, loss 0.0113
INFO [05:26:36.258] epoch 40, loss 0.0111
INFO [05:26:36.284] epoch 41, loss 0.0109
INFO [05:26:36.307] epoch 42, loss 0.0107
INFO [05:26:36.330] epoch 43, loss 0.0106
INFO [05:26:36.354] epoch 44, loss 0.0104
INFO [05:26:36.380] epoch 45, loss 0.0102
INFO [05:26:36.402] epoch 46, loss 0.0101
INFO [05:26:36.427] epoch 47, loss 0.0099
INFO [05:26:36.453] epoch 48, loss 0.0098
INFO [05:26:36.476] epoch 49, loss 0.0096
INFO [05:26:36.501] epoch 50, loss 0.0095
INFO [05:26:36.529] epoch 51, loss 0.0094
INFO [05:26:36.553] epoch 52, loss 0.0093
INFO [05:26:36.578] epoch 53, loss 0.0091
INFO [05:26:36.607] epoch 54, loss 0.0090
INFO [05:26:36.630] epoch 55, loss 0.0089
INFO [05:26:36.657] epoch 56, loss 0.0088
INFO [05:26:36.684] epoch 57, loss 0.0087
INFO [05:26:36.708] epoch 58, loss 0.0086
INFO [05:26:36.732] epoch 59, loss 0.0085
INFO [05:26:36.757] epoch 60, loss 0.0084
INFO [05:26:36.782] epoch 61, loss 0.0083
INFO [05:26:36.806] epoch 62, loss 0.0082
INFO [05:26:36.831] epoch 63, loss 0.0081
INFO [05:26:36.854] epoch 64, loss 0.0081
INFO [05:26:36.878] epoch 65, loss 0.0080
INFO [05:26:36.903] epoch 66, loss 0.0079
INFO [05:26:36.927] epoch 67, loss 0.0078
INFO [05:26:36.951] epoch 68, loss 0.0077
INFO [05:26:36.975] epoch 69, loss 0.0077
INFO [05:26:36.999] epoch 70, loss 0.0076
INFO [05:26:37.022] epoch 71, loss 0.0075
INFO [05:26:37.047] epoch 72, loss 0.0074
INFO [05:26:37.071] epoch 73, loss 0.0074
INFO [05:26:37.095] epoch 74, loss 0.0073
INFO [05:26:37.119] epoch 75, loss 0.0073
INFO [05:26:37.142] epoch 76, loss 0.0072
INFO [05:26:37.165] epoch 77, loss 0.0071
INFO [05:26:37.190] epoch 78, loss 0.0071
INFO [05:26:37.213] epoch 79, loss 0.0070
INFO [05:26:37.236] epoch 80, loss 0.0070
INFO [05:26:37.263] epoch 81, loss 0.0069
INFO [05:26:37.286] epoch 82, loss 0.0068
INFO [05:26:37.309] epoch 83, loss 0.0068
INFO [05:26:37.335] epoch 84, loss 0.0067
INFO [05:26:37.359] epoch 85, loss 0.0067
INFO [05:26:37.381] epoch 86, loss 0.0066
INFO [05:26:37.405] epoch 87, loss 0.0066
INFO [05:26:37.432] epoch 88, loss 0.0065
INFO [05:26:37.454] epoch 89, loss 0.0065
INFO [05:26:37.478] epoch 90, loss 0.0064
INFO [05:26:37.504] epoch 91, loss 0.0064
INFO [05:26:37.527] epoch 92, loss 0.0064
INFO [05:26:37.549] epoch 93, loss 0.0063
INFO [05:26:37.575] epoch 94, loss 0.0063
INFO [05:26:37.598] epoch 95, loss 0.0062
INFO [05:26:37.622] epoch 96, loss 0.0062
INFO [05:26:37.649] epoch 97, loss 0.0061
INFO [05:26:37.671] epoch 98, loss 0.0061
INFO [05:26:37.694] epoch 99, loss 0.0061
INFO [05:26:37.720] epoch 100, loss 0.0060
INFO [05:26:37.744] epoch 101, loss 0.0060
INFO [05:26:37.766] epoch 102, loss 0.0059
INFO [05:26:37.793] epoch 103, loss 0.0059
INFO [05:26:37.816] epoch 104, loss 0.0059
INFO [05:26:37.837] epoch 105, loss 0.0058
INFO [05:26:37.864] epoch 106, loss 0.0058
INFO [05:26:37.887] epoch 107, loss 0.0058
INFO [05:26:37.909] epoch 108, loss 0.0057
INFO [05:26:37.935] epoch 109, loss 0.0057
INFO [05:26:37.958] epoch 110, loss 0.0057
INFO [05:26:37.980] epoch 111, loss 0.0056
INFO [05:26:38.003] epoch 112, loss 0.0056
INFO [05:26:38.029] epoch 113, loss 0.0056
INFO [05:26:38.052] epoch 114, loss 0.0055
INFO [05:26:38.076] epoch 115, loss 0.0055
INFO [05:26:38.102] epoch 116, loss 0.0055
INFO [05:26:38.126] epoch 117, loss 0.0054
INFO [05:26:38.150] epoch 118, loss 0.0054
INFO [05:26:38.178] epoch 119, loss 0.0054
INFO [05:26:38.202] epoch 120, loss 0.0054
INFO [05:26:38.227] epoch 121, loss 0.0053
INFO [05:26:38.255] epoch 122, loss 0.0053
INFO [05:26:38.280] epoch 123, loss 0.0053
INFO [05:26:38.307] epoch 124, loss 0.0053
INFO [05:26:38.332] epoch 125, loss 0.0052
INFO [05:26:38.356] epoch 126, loss 0.0052
INFO [05:26:38.381] epoch 127, loss 0.0052
INFO [05:26:38.404] epoch 128, loss 0.0051
INFO [05:26:38.428] epoch 129, loss 0.0051
INFO [05:26:38.452] epoch 130, loss 0.0051
INFO [05:26:38.477] epoch 131, loss 0.0051
INFO [05:26:38.500] epoch 132, loss 0.0051
INFO [05:26:38.524] epoch 133, loss 0.0050
INFO [05:26:38.549] epoch 134, loss 0.0050
INFO [05:26:38.573] epoch 135, loss 0.0050
INFO [05:26:38.597] epoch 136, loss 0.0050
INFO [05:26:38.621] epoch 137, loss 0.0049
INFO [05:26:38.645] epoch 138, loss 0.0049
INFO [05:26:38.669] epoch 139, loss 0.0049
INFO [05:26:38.693] epoch 140, loss 0.0049
INFO [05:26:38.717] epoch 141, loss 0.0048
INFO [05:26:38.740] epoch 142, loss 0.0048
INFO [05:26:38.765] epoch 143, loss 0.0048
INFO [05:26:38.789] epoch 144, loss 0.0048
INFO [05:26:38.812] epoch 145, loss 0.0048
INFO [05:26:38.837] epoch 146, loss 0.0047
INFO [05:26:38.860] epoch 147, loss 0.0047
INFO [05:26:38.884] epoch 148, loss 0.0047
INFO [05:26:38.908] epoch 149, loss 0.0047
INFO [05:26:38.932] epoch 150, loss 0.0047
INFO [05:26:38.955] epoch 151, loss 0.0046
INFO [05:26:38.980] epoch 152, loss 0.0046
INFO [05:26:39.003] epoch 153, loss 0.0046
INFO [05:26:39.026] epoch 154, loss 0.0046
INFO [05:26:39.050] epoch 155, loss 0.0046
INFO [05:26:39.074] epoch 156, loss 0.0046
INFO [05:26:39.098] epoch 157, loss 0.0045
INFO [05:26:39.122] epoch 158, loss 0.0045
INFO [05:26:39.147] epoch 159, loss 0.0045
INFO [05:26:39.172] epoch 160, loss 0.0045
INFO [05:26:39.196] epoch 161, loss 0.0045
INFO [05:26:39.221] epoch 162, loss 0.0045
INFO [05:26:39.245] epoch 163, loss 0.0044
INFO [05:26:39.268] epoch 164, loss 0.0044
INFO [05:26:39.293] epoch 165, loss 0.0044
INFO [05:26:39.317] epoch 166, loss 0.0044
INFO [05:26:39.341] epoch 167, loss 0.0044
INFO [05:26:39.366] epoch 168, loss 0.0044
INFO [05:26:39.389] epoch 169, loss 0.0043
INFO [05:26:39.413] epoch 170, loss 0.0043
INFO [05:26:39.438] epoch 171, loss 0.0043
INFO [05:26:39.462] epoch 172, loss 0.0043
INFO [05:26:39.486] epoch 173, loss 0.0043
INFO [05:26:39.511] epoch 174, loss 0.0043
INFO [05:26:39.535] epoch 175, loss 0.0043
INFO [05:26:39.559] epoch 176, loss 0.0042
INFO [05:26:39.584] epoch 177, loss 0.0042
INFO [05:26:39.607] epoch 178, loss 0.0042
INFO [05:26:39.631] epoch 179, loss 0.0042
INFO [05:26:39.656] epoch 180, loss 0.0042
INFO [05:26:39.680] epoch 181, loss 0.0042
INFO [05:26:39.704] epoch 182, loss 0.0042
INFO [05:26:39.729] epoch 183, loss 0.0041
INFO [05:26:39.753] epoch 184, loss 0.0041
INFO [05:26:39.776] epoch 185, loss 0.0041
INFO [05:26:39.801] epoch 186, loss 0.0041
INFO [05:26:39.826] epoch 187, loss 0.0041
INFO [05:26:39.850] epoch 188, loss 0.0041
INFO [05:26:39.874] epoch 189, loss 0.0041
INFO [05:26:39.899] epoch 190, loss 0.0041
INFO [05:26:39.923] epoch 191, loss 0.0040
INFO [05:26:39.947] epoch 192, loss 0.0040
INFO [05:26:39.972] epoch 193, loss 0.0040
INFO [05:26:39.996] epoch 194, loss 0.0040
INFO [05:26:40.020] epoch 195, loss 0.0040
INFO [05:26:40.044] epoch 196, loss 0.0040
INFO [05:26:40.068] epoch 197, loss 0.0040
INFO [05:26:40.091] epoch 198, loss 0.0040
INFO [05:26:40.116] epoch 199, loss 0.0039
INFO [05:26:40.140] epoch 200, loss 0.0039
INFO [05:26:40.164] epoch 201, loss 0.0039
INFO [05:26:40.188] epoch 202, loss 0.0039
INFO [05:26:40.211] epoch 203, loss 0.0039
INFO [05:26:40.235] epoch 204, loss 0.0039
INFO [05:26:40.260] epoch 205, loss 0.0039
INFO [05:26:40.284] epoch 206, loss 0.0039
INFO [05:26:40.308] epoch 207, loss 0.0039
INFO [05:26:40.333] epoch 208, loss 0.0038
INFO [05:26:40.357] epoch 209, loss 0.0038
INFO [05:26:40.381] epoch 210, loss 0.0038
INFO [05:26:40.405] epoch 211, loss 0.0038
INFO [05:26:40.429] epoch 212, loss 0.0038
INFO [05:26:40.453] epoch 213, loss 0.0038
INFO [05:26:40.477] epoch 214, loss 0.0038
INFO [05:26:40.502] epoch 215, loss 0.0038
INFO [05:26:40.525] epoch 216, loss 0.0038
INFO [05:26:40.549] epoch 217, loss 0.0038
INFO [05:26:40.574] epoch 218, loss 0.0037
INFO [05:26:40.598] epoch 219, loss 0.0037
INFO [05:26:40.622] epoch 220, loss 0.0037
INFO [05:26:40.647] epoch 221, loss 0.0037
INFO [05:26:40.671] epoch 222, loss 0.0037
INFO [05:26:40.695] epoch 223, loss 0.0037
INFO [05:26:40.720] epoch 224, loss 0.0037
INFO [05:26:40.743] epoch 225, loss 0.0037
INFO [05:26:40.767] epoch 226, loss 0.0037
INFO [05:26:40.791] epoch 227, loss 0.0037
INFO [05:26:40.815] epoch 228, loss 0.0037
INFO [05:26:40.839] epoch 229, loss 0.0036
INFO [05:26:40.864] epoch 230, loss 0.0036
INFO [05:26:40.888] epoch 231, loss 0.0036
INFO [05:26:40.911] epoch 232, loss 0.0036
INFO [05:26:40.936] epoch 233, loss 0.0036
INFO [05:26:40.960] epoch 234, loss 0.0036
INFO [05:26:40.983] epoch 235, loss 0.0036
INFO [05:26:41.007] epoch 236, loss 0.0036
INFO [05:26:41.031] epoch 237, loss 0.0036
INFO [05:26:41.054] epoch 238, loss 0.0036
INFO [05:26:41.078] epoch 239, loss 0.0036
INFO [05:26:41.103] epoch 240, loss 0.0036
INFO [05:26:41.128] epoch 241, loss 0.0035
INFO [05:26:41.152] epoch 242, loss 0.0035
INFO [05:26:41.177] epoch 243, loss 0.0035
INFO [05:26:41.201] epoch 244, loss 0.0035
INFO [05:26:41.226] epoch 245, loss 0.0035
INFO [05:26:41.250] epoch 246, loss 0.0035
INFO [05:26:41.274] epoch 247, loss 0.0035
INFO [05:26:41.298] epoch 248, loss 0.0035
INFO [05:26:41.323] epoch 249, loss 0.0035
INFO [05:26:41.347] epoch 250, loss 0.0035
glove_context <- glove$components
word_vectors <- glove_main + t(glove_context)
churchVec <- word_vectors["bible", , drop = FALSE] -
word_vectors["jesus", , drop = FALSE] +
word_vectors["muhammad", , drop = FALSE]
cos_sim <- sim2(x = word_vectors, y = churchVec, method = "cosine", norm = "l2")
head(sort(cos_sim[, 1], decreasing = TRUE), 10)
muhammad bible mecca banu books ce medina
0.5809943 0.4536537 0.4057872 0.3936433 0.3564263 0.3491453 0.3110081
old prophet greek
0.3096064 0.3033618 0.3023961